CS4132 Data Analytics
Basketball is one of the most popular sports in the world. It is a team sport where opposing sides face off against one another to shoot a basketball into a hoop mounted 3 meters above the ground. This sport has had a great cultural impact worldwide and changed society.
Throughout the tides of time, basketball has always stood out as a beacon of entertainment. The biggest source of basketball entertainment has consistently been the NBA, to the point where they have become the central authority of the sport. I wish to analyze how the sport has changed over the years, specifically with the NBA. It would be very interesting to know how viewership, players and tactics have fluctuated over the years, and see how significant world events like covid-19 has impacted the sport.
We will be using a bit of technical words throughout this report, so I will define most of them here for clarity purposes.
the five positions are known by unique names: point guard (PG), the shooting guard (SG), the small forward (SF), the power forward (PF), and the center (C)
In basketball, a free throw is a specific kind of shot that is taken when a foul is called. A player on the opposing team receives free throws and shoots them from the free throw line when someone commits a shooting foul.
A field goal in basketball is a basket made with any shot or tap other than a free throw. A slam dunk is a particular kind of field goal. When a player jumps at the basket while holding the ball, the ball is thrown through the basket while in the air.
A three-pointer is a field goal that scores three points. A player counts each shot they make from beyond the three-point arc as a three-pointer (in the diagram, behind the pink shaded region)
!pip install html5lib
!pip install gtab
!pip install plotly
!pip install chart_studio
!pip install mlxtend
Requirement already satisfied: html5lib in c:\users\admin\anaconda3\lib\site-packages (1.1) Requirement already satisfied: webencodings in c:\users\admin\anaconda3\lib\site-packages (from html5lib) (0.5.1) Requirement already satisfied: six>=1.9 in c:\users\admin\anaconda3\lib\site-packages (from html5lib) (1.16.0) Requirement already satisfied: gtab in c:\users\admin\anaconda3\lib\site-packages (0.8) Requirement already satisfied: tqdm in c:\users\admin\anaconda3\lib\site-packages (from gtab) (4.64.0) Requirement already satisfied: networkx in c:\users\admin\anaconda3\lib\site-packages (from gtab) (2.7.1) Requirement already satisfied: pytrends in c:\users\admin\anaconda3\lib\site-packages (from gtab) (4.8.0) Requirement already satisfied: pandas in c:\users\admin\anaconda3\lib\site-packages (from gtab) (1.4.2) Requirement already satisfied: numpy in c:\users\admin\anaconda3\lib\site-packages (from gtab) (1.21.5) Requirement already satisfied: pytz>=2020.1 in c:\users\admin\anaconda3\lib\site-packages (from pandas->gtab) (2021.3) Requirement already satisfied: python-dateutil>=2.8.1 in c:\users\admin\anaconda3\lib\site-packages (from pandas->gtab) (2.8.2) Requirement already satisfied: six>=1.5 in c:\users\admin\anaconda3\lib\site-packages (from python-dateutil>=2.8.1->pandas->gtab) (1.16.0) Requirement already satisfied: requests>=2.0 in c:\users\admin\anaconda3\lib\site-packages (from pytrends->gtab) (2.27.1) Requirement already satisfied: lxml in c:\users\admin\anaconda3\lib\site-packages (from pytrends->gtab) (4.8.0) Requirement already satisfied: urllib3<1.27,>=1.21.1 in c:\users\admin\anaconda3\lib\site-packages (from requests>=2.0->pytrends->gtab) (1.26.9) Requirement already satisfied: idna<4,>=2.5 in c:\users\admin\anaconda3\lib\site-packages (from requests>=2.0->pytrends->gtab) (3.3) Requirement already satisfied: charset-normalizer~=2.0.0 in c:\users\admin\anaconda3\lib\site-packages (from requests>=2.0->pytrends->gtab) (2.0.4) Requirement already satisfied: certifi>=2017.4.17 in c:\users\admin\anaconda3\lib\site-packages (from requests>=2.0->pytrends->gtab) (2022.6.15) Requirement already satisfied: colorama in c:\users\admin\anaconda3\lib\site-packages (from tqdm->gtab) (0.4.4) Requirement already satisfied: plotly in c:\users\admin\anaconda3\lib\site-packages (5.6.0) Requirement already satisfied: tenacity>=6.2.0 in c:\users\admin\anaconda3\lib\site-packages (from plotly) (8.0.1) Requirement already satisfied: six in c:\users\admin\anaconda3\lib\site-packages (from plotly) (1.16.0) Requirement already satisfied: chart_studio in c:\users\admin\anaconda3\lib\site-packages (1.1.0) Requirement already satisfied: requests in c:\users\admin\anaconda3\lib\site-packages (from chart_studio) (2.27.1) Requirement already satisfied: six in c:\users\admin\anaconda3\lib\site-packages (from chart_studio) (1.16.0) Requirement already satisfied: retrying>=1.3.3 in c:\users\admin\anaconda3\lib\site-packages (from chart_studio) (1.3.3) Requirement already satisfied: plotly in c:\users\admin\anaconda3\lib\site-packages (from chart_studio) (5.6.0) Requirement already satisfied: tenacity>=6.2.0 in c:\users\admin\anaconda3\lib\site-packages (from plotly->chart_studio) (8.0.1) Requirement already satisfied: urllib3<1.27,>=1.21.1 in c:\users\admin\anaconda3\lib\site-packages (from requests->chart_studio) (1.26.9) Requirement already satisfied: charset-normalizer~=2.0.0 in c:\users\admin\anaconda3\lib\site-packages (from requests->chart_studio) (2.0.4) Requirement already satisfied: certifi>=2017.4.17 in c:\users\admin\anaconda3\lib\site-packages (from requests->chart_studio) (2022.6.15) Requirement already satisfied: idna<4,>=2.5 in c:\users\admin\anaconda3\lib\site-packages (from requests->chart_studio) (3.3) Requirement already satisfied: mlxtend in c:\users\admin\anaconda3\lib\site-packages (0.21.0) Requirement already satisfied: scipy>=1.2.1 in c:\users\admin\anaconda3\lib\site-packages (from mlxtend) (1.7.3) Requirement already satisfied: joblib>=0.13.2 in c:\users\admin\anaconda3\lib\site-packages (from mlxtend) (1.1.0) Requirement already satisfied: matplotlib>=3.0.0 in c:\users\admin\anaconda3\lib\site-packages (from mlxtend) (3.5.1) Requirement already satisfied: setuptools in c:\users\admin\anaconda3\lib\site-packages (from mlxtend) (61.2.0) Requirement already satisfied: numpy>=1.16.2 in c:\users\admin\anaconda3\lib\site-packages (from mlxtend) (1.21.5) Requirement already satisfied: scikit-learn>=1.0.2 in c:\users\admin\anaconda3\lib\site-packages (from mlxtend) (1.0.2) Requirement already satisfied: pandas>=0.24.2 in c:\users\admin\anaconda3\lib\site-packages (from mlxtend) (1.4.2) Requirement already satisfied: python-dateutil>=2.7 in c:\users\admin\anaconda3\lib\site-packages (from matplotlib>=3.0.0->mlxtend) (2.8.2) Requirement already satisfied: packaging>=20.0 in c:\users\admin\anaconda3\lib\site-packages (from matplotlib>=3.0.0->mlxtend) (21.3) Requirement already satisfied: pillow>=6.2.0 in c:\users\admin\anaconda3\lib\site-packages (from matplotlib>=3.0.0->mlxtend) (9.0.1) Requirement already satisfied: cycler>=0.10 in c:\users\admin\anaconda3\lib\site-packages (from matplotlib>=3.0.0->mlxtend) (0.11.0) Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\admin\anaconda3\lib\site-packages (from matplotlib>=3.0.0->mlxtend) (1.3.2) Requirement already satisfied: pyparsing>=2.2.1 in c:\users\admin\anaconda3\lib\site-packages (from matplotlib>=3.0.0->mlxtend) (3.0.4) Requirement already satisfied: fonttools>=4.22.0 in c:\users\admin\anaconda3\lib\site-packages (from matplotlib>=3.0.0->mlxtend) (4.25.0) Requirement already satisfied: pytz>=2020.1 in c:\users\admin\anaconda3\lib\site-packages (from pandas>=0.24.2->mlxtend) (2021.3) Requirement already satisfied: six>=1.5 in c:\users\admin\anaconda3\lib\site-packages (from python-dateutil>=2.7->matplotlib>=3.0.0->mlxtend) (1.16.0) Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\admin\anaconda3\lib\site-packages (from scikit-learn>=1.0.2->mlxtend) (2.2.0)
import pandas as pd
import unidecode
from tqdm.notebook import tqdm
import plotly.graph_objects as go
import numpy as np
from bs4 import BeautifulSoup, Comment
import re
import chart_studio.plotly as py
import plotly.offline as po
import plotly.graph_objs as pg
import matplotlib.pyplot as plt
import requests
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from scipy.optimize import curve_fit
from sklearn.linear_model import LogisticRegression
from plotly.colors import n_colors
from statsmodels.graphics import tsaplots
from importlib import reload
from mpl_toolkits.mplot3d import Axes3D
tqdm.pandas()
We first obtain the details from the basketballreference website. This will give us all the player names, height, weight, place of birth, etc.
Players = pd.read_csv("Players.csv")
Players
| Unnamed: 0 | Player | From | To | Pos | Ht | Wt | Birth Date | Colleges | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | Alaa Abdelnaby | 1991 | 1995 | F-C | 6-10 | 240.0 | June 24, 1968 | Duke |
| 1 | 1 | Zaid Abdul-Aziz | 1969 | 1978 | C-F | 6-9 | 235.0 | April 7, 1946 | Iowa State |
| 2 | 2 | Kareem Abdul-Jabbar* | 1970 | 1989 | C | 7-2 | 225.0 | April 16, 1947 | UCLA |
| 3 | 3 | Mahmoud Abdul-Rauf | 1991 | 2001 | G | 6-1 | 162.0 | March 9, 1969 | LSU |
| 4 | 4 | Tariq Abdul-Wahad | 1998 | 2003 | F | 6-6 | 223.0 | November 3, 1974 | Michigan, San Jose State |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 5018 | 5018 | Ante Žižić | 2018 | 2020 | F-C | 6-10 | 266.0 | January 4, 1997 | NaN |
| 5019 | 5019 | Jim Zoet | 1983 | 1983 | C | 7-1 | 240.0 | December 20, 1953 | Kent State University |
| 5020 | 5020 | Bill Zopf | 1971 | 1971 | G | 6-1 | 170.0 | June 7, 1948 | Duquesne |
| 5021 | 5021 | Ivica Zubac | 2017 | 2022 | C | 7-0 | 240.0 | March 18, 1997 | NaN |
| 5022 | 5022 | Matt Zunic | 1949 | 1949 | G-F | 6-3 | 195.0 | December 19, 1919 | George Washington |
5023 rows × 9 columns
salaryData = pd.read_csv("SalaryTeamData2.csv")
salaryData
| Unnamed: 0 | Season | Age | Tm | Lg | Pos | G | GS | MP | FG | ... | AST | STL | BLK | TOV | PF | PTS | Team | Salary | Unnamed: 30 | Trp Dbl | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 1990-91 | 22.0 | POR | NBA | PF | 5.0 | 0.0 | 2.6 | 0.4 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.8 | Portland Trail Blazers | $395,000 | NaN | NaN |
| 1 | 1 | 1991-92 | 23.0 | POR | NBA | PF | 8.0 | 0.0 | 3.1 | 0.6 | ... | 0.3 | 0.0 | 0.0 | 0.3 | 0.5 | 1.5 | Portland Trail Blazers | $494,000 | NaN | NaN |
| 2 | 2 | 1992-93 | 24.0 | BOS | NBA | PF | 4.0 | 4.0 | 17.0 | 2.8 | ... | 0.3 | 0.0 | 0.3 | 2.3 | 1.8 | 5.5 | Boston Celtics | $500,000 | NaN | NaN |
| 3 | 0 | 1984-85 | 37.0 | LAL | NBA | C | 19.0 | 19.0 | 32.1 | 8.8 | ... | 4.0 | 1.2 | 1.9 | 2.7 | 3.5 | 21.9 | Los Angeles Lakers | $1,530,000 | NaN | NaN |
| 4 | 1 | 1985-86 | 38.0 | LAL | NBA | C | 14.0 | 14.0 | 34.9 | 11.2 | ... | 3.5 | 1.1 | 1.7 | 3.0 | 3.9 | 25.9 | Los Angeles Lakers | $2,030,000 | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 10759 | 3 | 2018-19 | 21.0 | LAL | NBA | C | 33.0 | 12.0 | 15.6 | 3.4 | ... | 0.8 | 0.1 | 0.8 | 1.0 | 2.2 | 8.5 | Los Angeles Clippers | $1,544,951 | NaN | NaN |
| 10760 | 4 | 2018-19 | 21.0 | LAC | NBA | C | 26.0 | 25.0 | 20.2 | 3.8 | ... | 1.5 | 0.4 | 0.9 | 1.4 | 2.5 | 9.4 | Los Angeles Clippers | $1,544,951 | NaN | NaN |
| 10761 | 5 | 2019-20 | 22.0 | LAC | NBA | C | 72.0 | 70.0 | 18.4 | 3.3 | ... | 1.1 | 0.2 | 0.9 | 0.8 | 2.3 | 8.3 | Los Angeles Clippers | $6,481,482 | NaN | NaN |
| 10762 | 6 | 2020-21 | 23.0 | LAC | NBA | C | 72.0 | 33.0 | 22.3 | 3.6 | ... | 1.3 | 0.3 | 0.9 | 1.1 | 2.6 | 9.0 | Los Angeles Clippers | $7,000,000 | NaN | NaN |
| 10763 | 7 | 2021-22 | 24.0 | LAC | NBA | C | 76.0 | 76.0 | 24.4 | 4.1 | ... | 1.6 | 0.5 | 1.0 | 1.5 | 2.7 | 10.3 | Los Angeles Clippers | $7,518,518 | NaN | NaN |
10764 rows × 35 columns
inflation_data = pd.read_csv("inflation_data.csv")
inflation_data
| year | amount | inflation rate | |
|---|---|---|---|
| 0 | 1800 | 1.00 | 0.02 |
| 1 | 1801 | 1.01 | 0.01 |
| 2 | 1802 | 0.85 | -0.16 |
| 3 | 1803 | 0.90 | 0.06 |
| 4 | 1804 | 0.94 | 0.04 |
| ... | ... | ... | ... |
| 218 | 2018 | 19.94 | 0.02 |
| 219 | 2019 | 20.29 | 0.02 |
| 220 | 2020 | 20.54 | 0.01 |
| 221 | 2021 | 21.51 | 0.05 |
| 222 | 2022 | 23.51 | 0.09 |
223 rows × 3 columns
However, later when we analyze each player, and go onto their personal pages, the websites give us the full country name (or state for the US) name. However, we want the alpha-3 codes to plot this, so we will get the following codes, obtaioned from web scraping.
country_codes_data = pd.read_csv("CountryCodes.csv")
country_codes_data
| Unnamed: 0 | COUNTRY | COUNTRY CODE | ISO CODES | |
|---|---|---|---|---|
| 0 | 0 | Afghanistan | 93 | AF / AFG |
| 1 | 1 | Albania | 355 | AL / ALB |
| 2 | 2 | Algeria | 213 | DZ / DZA |
| 3 | 3 | American Samoa | 1-684 | AS / ASM |
| 4 | 4 | Andorra | 376 | AD / AND |
| ... | ... | ... | ... | ... |
| 235 | 235 | Wallis and Futuna | 681 | WF / WLF |
| 236 | 236 | Western Sahara | 212 | EH / ESH |
| 237 | 237 | Yemen | 967 | YE / YEM |
| 238 | 238 | Zambia | 260 | ZM / ZMB |
| 239 | 239 | Zimbabwe | 263 | ZW / ZWE |
240 rows × 4 columns
state_codes_data = pd.read_csv("StateCodes.csv")#state_codes_data
state_codes_data
| Unnamed: 0 | 0 | 1 | 2 | 3 | 4 | 5 | |
|---|---|---|---|---|---|---|---|
| 0 | 0 | State | Postal Abbr. | FIPS Code | State | Postal Abbr. | FIPS Code |
| 1 | 1 | Alabama | AL | 01 | Nebraska | NE | 31 |
| 2 | 2 | Alaska | AK | 02 | Nevada | NV | 32 |
| 3 | 3 | Arizona | AZ | 04 | New Hampshire | NH | 33 |
| 4 | 4 | Arkansas | AR | 05 | New Jersey | NJ | 34 |
| 5 | 5 | California | CA | 06 | New Mexico | NM | 35 |
| 6 | 6 | Colorado | CO | 08 | New York | NY | 36 |
| 7 | 7 | Connecticut | CT | 09 | North Carolina | NC | 37 |
| 8 | 8 | Delaware | DE | 10 | North Dakota | ND | 38 |
| 9 | 9 | District of Columbia | DC | 11 | Ohio | OH | 39 |
| 10 | 10 | Florida | FL | 12 | Oklahoma | OK | 40 |
| 11 | 11 | Georgia | GA | 13 | Oregon | OR | 41 |
| 12 | 12 | Hawaii | HI | 15 | Pennsylvania | PA | 42 |
| 13 | 13 | Idaho | ID | 16 | Puerto Rico | PR | 72 |
| 14 | 14 | Illinois | IL | 17 | Rhode Island | RI | 44 |
| 15 | 15 | Indiana | IN | 18 | South Carolina | SC | 45 |
| 16 | 16 | Iowa | IA | 19 | South Dakota | SD | 46 |
| 17 | 17 | Kansas | KS | 20 | Tennessee | TN | 47 |
| 18 | 18 | Kentucky | KY | 21 | Texas | TX | 48 |
| 19 | 19 | Louisiana | LA | 22 | Utah | UT | 49 |
| 20 | 20 | Maine | ME | 23 | Vermont | VT | 50 |
| 21 | 21 | Maryland | MD | 24 | Virginia | VA | 51 |
| 22 | 22 | Massachusetts | MA | 25 | Virgin Islands | VI | 78 |
| 23 | 23 | Michigan | MI | 26 | Washington | WA | 53 |
| 24 | 24 | Minnesota | MN | 27 | West Virginia | WV | 54 |
| 25 | 25 | Mississippi | MS | 28 | Wisconsin | WI | 55 |
| 26 | 26 | Missouri | MO | 29 | Wyoming | WY | 56 |
| 27 | 27 | Montana | MT | 30 | Â | Â | Â |
country_codes_dataa = pd.read_csv("CountryCodesThreeLetter.csv")
country_codes_dataa
| Unnamed: 0 | Country | Alpha-2 code | Alpha-3 code | Numeric | |
|---|---|---|---|---|---|
| 0 | 0 | Afghanistan | AF | AFG | 4 |
| 1 | 1 | Åland Islands | AX | ALA | 248 |
| 2 | 2 | Albania | AL | ALB | 8 |
| 3 | 3 | Algeria | DZ | DZA | 12 |
| 4 | 4 | American Samoa | AS | ASM | 16 |
| ... | ... | ... | ... | ... | ... |
| 244 | 244 | Wallis and Futuna | WF | WLF | 876 |
| 245 | 245 | Western Sahara | EH | ESH | 732 |
| 246 | 246 | Yemen | YE | YEM | 887 |
| 247 | 247 | Zambia | ZM | ZMB | 894 |
| 248 | 248 | Zimbabwe | ZW | ZWE | 716 |
249 rows × 5 columns
To gauge the interest over time, since the NBA viewership is not made public, we can do this indirectly by taking interest from google trends
BasketBallTrends = pd.read_csv("multiTimeline.csv")
BasketBallTrends
| Category: All categories | |
|---|---|
| Month | Basketball: (Worldwide) |
| 2004-01 | 47 |
| 2004-02 | 48 |
| 2004-03 | 67 |
| 2004-04 | 34 |
| ... | ... |
| 2022-05 | 32 |
| 2022-06 | 32 |
| 2022-07 | 28 |
| 2022-08 | 28 |
| 2022-09 | 44 |
226 rows × 1 columns
geo = pd.read_csv("geoMap (1).csv")
geo
| Category: All categories | |
|---|---|
| Country | Basketball: (01/01/2004 - 20/09/2022) |
| Lithuania | 100 |
| United States | 56 |
| Montenegro | 56 |
| Marshall Islands | 55 |
| ... | ... |
| Tokelau | NaN |
| Tuvalu | NaN |
| US Outlying Islands | NaN |
| Vatican City | NaN |
| Wallis & Futuna | NaN |
251 rows × 1 columns
details = pd.read_csv('games_details.csv')
details
C:\Users\admin\AppData\Local\Temp\ipykernel_25424\2901751251.py:1: DtypeWarning: Columns (6) have mixed types. Specify dtype option on import or set low_memory=False.
| GAME_ID | TEAM_ID | TEAM_ABBREVIATION | TEAM_CITY | PLAYER_ID | PLAYER_NAME | NICKNAME | START_POSITION | COMMENT | MIN | ... | OREB | DREB | REB | AST | STL | BLK | TO | PF | PTS | PLUS_MINUS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 22101005 | 1610612750 | MIN | Minnesota | 1630162 | Anthony Edwards | Anthony | F | NaN | 36:22 | ... | 0.0 | 8.0 | 8.0 | 5.0 | 3.0 | 1.0 | 1.0 | 1.0 | 15.0 | 5.0 |
| 1 | 22101005 | 1610612750 | MIN | Minnesota | 1630183 | Jaden McDaniels | Jaden | F | NaN | 23:54 | ... | 2.0 | 4.0 | 6.0 | 0.0 | 0.0 | 2.0 | 2.0 | 6.0 | 14.0 | 10.0 |
| 2 | 22101005 | 1610612750 | MIN | Minnesota | 1626157 | Karl-Anthony Towns | Karl-Anthony | C | NaN | 25:17 | ... | 1.0 | 9.0 | 10.0 | 0.0 | 0.0 | 0.0 | 3.0 | 4.0 | 15.0 | 14.0 |
| 3 | 22101005 | 1610612750 | MIN | Minnesota | 1627736 | Malik Beasley | Malik | G | NaN | 30:52 | ... | 0.0 | 3.0 | 3.0 | 1.0 | 1.0 | 0.0 | 1.0 | 4.0 | 12.0 | 20.0 |
| 4 | 22101005 | 1610612750 | MIN | Minnesota | 1626156 | D'Angelo Russell | D'Angelo | G | NaN | 33:46 | ... | 0.0 | 6.0 | 6.0 | 9.0 | 1.0 | 0.0 | 5.0 | 0.0 | 14.0 | 17.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 645948 | 11200005 | 1610612743 | DEN | Denver | 202706 | Jordan Hamilton | NaN | NaN | NaN | 19 | ... | 0.0 | 2.0 | 2.0 | 0.0 | 2.0 | 0.0 | 1.0 | 3.0 | 17.0 | NaN |
| 645949 | 11200005 | 1610612743 | DEN | Denver | 202702 | Kenneth Faried | NaN | NaN | NaN | 23 | ... | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 | 3.0 | 3.0 | 18.0 | NaN |
| 645950 | 11200005 | 1610612743 | DEN | Denver | 201585 | Kosta Koufos | NaN | NaN | NaN | 15 | ... | 3.0 | 5.0 | 8.0 | 0.0 | 1.0 | 0.0 | 0.0 | 3.0 | 6.0 | NaN |
| 645951 | 11200005 | 1610612743 | DEN | Denver | 202389 | Timofey Mozgov | NaN | NaN | NaN | 19 | ... | 1.0 | 2.0 | 3.0 | 1.0 | 0.0 | 0.0 | 4.0 | 2.0 | 2.0 | NaN |
| 645952 | 11200005 | 1610612743 | DEN | Denver | 201951 | Ty Lawson | NaN | NaN | NaN | 27 | ... | 0.0 | 2.0 | 2.0 | 6.0 | 2.0 | 0.0 | 6.0 | 1.0 | 8.0 | NaN |
645953 rows × 29 columns
games = pd.read_csv('games.csv')[["GAME_ID","SEASON"]]
games
| GAME_ID | SEASON | |
|---|---|---|
| 0 | 22101005 | 2021 |
| 1 | 22101006 | 2021 |
| 2 | 22101007 | 2021 |
| 3 | 22101008 | 2021 |
| 4 | 22101009 | 2021 |
| ... | ... | ... |
| 25791 | 11400007 | 2014 |
| 25792 | 11400004 | 2014 |
| 25793 | 11400005 | 2014 |
| 25794 | 11400002 | 2014 |
| 25795 | 11400001 | 2014 |
25796 rows × 2 columns
There is a lot of data that is not needed in the country code and states code data, such as FIPS code, s/no, etc.
So we will remove these.
country_code = (pd.DataFrame([country_codes_data["COUNTRY"],country_codes_data["ISO CODES"].str.split(" / ").str[0]])).T
country_code[country_code["ISO CODES"] != "US"]
#country_code.columns = ["COUNTRY", "ISO CODES"]
country_code
| COUNTRY | ISO CODES | |
|---|---|---|
| 0 | Afghanistan | AF |
| 1 | Albania | AL |
| 2 | Algeria | DZ |
| 3 | American Samoa | AS |
| 4 | Andorra | AD |
| ... | ... | ... |
| 235 | Wallis and Futuna | WF |
| 236 | Western Sahara | EH |
| 237 | Yemen | YE |
| 238 | Zambia | ZM |
| 239 | Zimbabwe | ZW |
240 rows × 2 columns
state_codes_data = pd.read_csv("StateCodes.csv")
state_codes_dat = pd.concat([state_codes_data.loc[1:,:'2'],state_codes_data.loc[1:,'3':].rename(columns={'3':'0','4':'1','5':'2'})], ignore_index=True)
state_codes_dat = state_codes_dat.loc[:52,'0':'1']
state_code = (pd.DataFrame([state_codes_dat['0'],state_codes_dat['1']])).T
state_code.columns = ["Country","Alpha code"]
state_code["AlphaThree Code"] = "USA"
state_code
| Country | Alpha code | AlphaThree Code | |
|---|---|---|---|
| 0 | Alabama | AL | USA |
| 1 | Alaska | AK | USA |
| 2 | Arizona | AZ | USA |
| 3 | Arkansas | AR | USA |
| 4 | California | CA | USA |
| 5 | Colorado | CO | USA |
| 6 | Connecticut | CT | USA |
| 7 | Delaware | DE | USA |
| 8 | District of Columbia | DC | USA |
| 9 | Florida | FL | USA |
| 10 | Georgia | GA | USA |
| 11 | Hawaii | HI | USA |
| 12 | Idaho | ID | USA |
| 13 | Illinois | IL | USA |
| 14 | Indiana | IN | USA |
| 15 | Iowa | IA | USA |
| 16 | Kansas | KS | USA |
| 17 | Kentucky | KY | USA |
| 18 | Louisiana | LA | USA |
| 19 | Maine | ME | USA |
| 20 | Maryland | MD | USA |
| 21 | Massachusetts | MA | USA |
| 22 | Michigan | MI | USA |
| 23 | Minnesota | MN | USA |
| 24 | Mississippi | MS | USA |
| 25 | Missouri | MO | USA |
| 26 | Montana | MT | USA |
| 27 | Nebraska | NE | USA |
| 28 | Nevada | NV | USA |
| 29 | New Hampshire | NH | USA |
| 30 | New Jersey | NJ | USA |
| 31 | New Mexico | NM | USA |
| 32 | New York | NY | USA |
| 33 | North Carolina | NC | USA |
| 34 | North Dakota | ND | USA |
| 35 | Ohio | OH | USA |
| 36 | Oklahoma | OK | USA |
| 37 | Oregon | OR | USA |
| 38 | Pennsylvania | PA | USA |
| 39 | Puerto Rico | PR | USA |
| 40 | Rhode Island | RI | USA |
| 41 | South Carolina | SC | USA |
| 42 | South Dakota | SD | USA |
| 43 | Tennessee | TN | USA |
| 44 | Texas | TX | USA |
| 45 | Utah | UT | USA |
| 46 | Vermont | VT | USA |
| 47 | Virginia | VA | USA |
| 48 | Virgin Islands | VI | USA |
| 49 | Washington | WA | USA |
| 50 | West Virginia | WV | USA |
| 51 | Wisconsin | WI | USA |
| 52 | Wyoming | WY | USA |
country_code_alpha = (pd.DataFrame([country_codes_dataa["Alpha-2 code"],country_codes_dataa["Alpha-3 code"]])).T
country_code_alpha
| Alpha-2 code | Alpha-3 code | |
|---|---|---|
| 0 | AF | AFG |
| 1 | AX | ALA |
| 2 | AL | ALB |
| 3 | DZ | DZA |
| 4 | AS | ASM |
| ... | ... | ... |
| 244 | WF | WLF |
| 245 | EH | ESH |
| 246 | YE | YEM |
| 247 | ZM | ZMB |
| 248 | ZW | ZWE |
249 rows × 2 columns
The names obtained from the website must be decoded to be usable. Accents on letters make it a lot harder to use down the line. Hence we will be applying this function to decode it all.
def personalInfo(row):
row.Player = unidecode.unidecode(row.Player)
return row
Players.apply(personalInfo,axis="columns")
| Unnamed: 0 | Player | From | To | Pos | Ht | Wt | Birth Date | Colleges | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | Alaa Abdelnaby | 1991 | 1995 | F-C | 6-10 | 240.0 | June 24, 1968 | Duke |
| 1 | 1 | Zaid Abdul-Aziz | 1969 | 1978 | C-F | 6-9 | 235.0 | April 7, 1946 | Iowa State |
| 2 | 2 | Kareem Abdul-Jabbar* | 1970 | 1989 | C | 7-2 | 225.0 | April 16, 1947 | UCLA |
| 3 | 3 | Mahmoud Abdul-Rauf | 1991 | 2001 | G | 6-1 | 162.0 | March 9, 1969 | LSU |
| 4 | 4 | Tariq Abdul-Wahad | 1998 | 2003 | F | 6-6 | 223.0 | November 3, 1974 | Michigan, San Jose State |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 5018 | 5018 | Ante Zizic | 2018 | 2020 | F-C | 6-10 | 266.0 | January 4, 1997 | NaN |
| 5019 | 5019 | Jim Zoet | 1983 | 1983 | C | 7-1 | 240.0 | December 20, 1953 | Kent State University |
| 5020 | 5020 | Bill Zopf | 1971 | 1971 | G | 6-1 | 170.0 | June 7, 1948 | Duquesne |
| 5021 | 5021 | Ivica Zubac | 2017 | 2022 | C | 7-0 | 240.0 | March 18, 1997 | NaN |
| 5022 | 5022 | Matt Zunic | 1949 | 1949 | G-F | 6-3 | 195.0 | December 19, 1919 | George Washington |
5023 rows × 9 columns
To obtain the country and U.S. states of eeach individual player, we have to scrape each birthplaces website, and match it to the player.
currentcode=""
def labelUsers(row):
global currentcode
global Players
Players.loc[Players["Player"]==row["Player"], "Country"] = currentcode
def findPlayersnotUS(row):
global currentcode
try:
if(row["ISO CODES"]=="US"):
pass
web = "https://www.basketball-reference.com/friv/birthplaces.fcgi?country="+row["ISO CODES"]+"&state="
url = requests.get(web)
data = pd.read_html(url.text)
currentcode = row["ISO CODES"]
data[-1].columns = data[-1].columns.droplevel(0)
data[-1].insert(2,"Country",[row["ISO CODES"] for i in range(len(data[-1]))])
data[-1].apply(labelUsers,axis="columns")
except ValueError:
pass
except TypeError:
pass
def findPlayersUS(row):
global currentcode
#print(row["Alpha code"])
web = "https://www.basketball-reference.com/friv/birthplaces.fcgi?country=US&state="+row["Alpha code"]
url = requests.get(web)
try:
data = pd.read_html(url.text)
currentcode = row["Alpha code"]
data[-1].columns = data[-1].columns.droplevel(0)
data[-1].insert(2,"Country",[row["Alpha code"] for i in range(len(data[-1]))])
data[-1].apply(labelUsers,axis="columns")
except ValueError:
pass
Players.insert(len(Players.columns),"Country",["" for i in range(len(Players.Player))])
Players
| Unnamed: 0 | Player | From | To | Pos | Ht | Wt | Birth Date | Colleges | Country | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | Alaa Abdelnaby | 1991 | 1995 | F-C | 6-10 | 240.0 | June 24, 1968 | Duke | |
| 1 | 1 | Zaid Abdul-Aziz | 1969 | 1978 | C-F | 6-9 | 235.0 | April 7, 1946 | Iowa State | |
| 2 | 2 | Kareem Abdul-Jabbar* | 1970 | 1989 | C | 7-2 | 225.0 | April 16, 1947 | UCLA | |
| 3 | 3 | Mahmoud Abdul-Rauf | 1991 | 2001 | G | 6-1 | 162.0 | March 9, 1969 | LSU | |
| 4 | 4 | Tariq Abdul-Wahad | 1998 | 2003 | F | 6-6 | 223.0 | November 3, 1974 | Michigan, San Jose State | |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 5018 | 5018 | Ante Žižić | 2018 | 2020 | F-C | 6-10 | 266.0 | January 4, 1997 | NaN | |
| 5019 | 5019 | Jim Zoet | 1983 | 1983 | C | 7-1 | 240.0 | December 20, 1953 | Kent State University | |
| 5020 | 5020 | Bill Zopf | 1971 | 1971 | G | 6-1 | 170.0 | June 7, 1948 | Duquesne | |
| 5021 | 5021 | Ivica Zubac | 2017 | 2022 | C | 7-0 | 240.0 | March 18, 1997 | NaN | |
| 5022 | 5022 | Matt Zunic | 1949 | 1949 | G-F | 6-3 | 195.0 | December 19, 1919 | George Washington |
5023 rows × 10 columns
country_code.progress_apply(findPlayersnotUS,axis="columns")
0%| | 0/240 [00:00<?, ?it/s]
0 None
1 None
2 None
3 None
4 None
...
235 None
236 None
237 None
238 None
239 None
Length: 240, dtype: object state_code.progress_apply(findPlayersUS,axis="columns")
0%| | 0/53 [00:00<?, ?it/s]
0 None 1 None 2 None 3 None 4 None 5 None 6 None 7 None 8 None 9 None 10 None 11 None 12 None 13 None 14 None 15 None 16 None 17 None 18 None 19 None 20 None 21 None 22 None 23 None 24 None 25 None 26 None 27 None 28 None 29 None 30 None 31 None 32 None 33 None 34 None 35 None 36 None 37 None 38 None 39 None 40 None 41 None 42 None 43 None 44 None 45 None 46 None 47 None 48 None 49 None 50 None 51 None 52 None dtype: object
Now we remove the duplicate indexes in our Players dataframe. We will also convert the Height data (which is in feet/inches) to centimeters.
Players = Players.loc[~Players.index.duplicated(), :]
Players["HtCm"]=(12*pd.to_numeric(Players["Ht"].str[:1])+pd.to_numeric(Players["Ht"].str[2:]))*2.54
Players
| Unnamed: 0 | Player | From | To | Pos | Ht | Wt | Birth Date | Colleges | Country | HtCm | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | Alaa Abdelnaby | 1991 | 1995 | F-C | 6-10 | 240.0 | June 24, 1968 | Duke | EG | 208.28 |
| 1 | 1 | Zaid Abdul-Aziz | 1969 | 1978 | C-F | 6-9 | 235.0 | April 7, 1946 | Iowa State | NY | 205.74 |
| 2 | 2 | Kareem Abdul-Jabbar* | 1970 | 1989 | C | 7-2 | 225.0 | April 16, 1947 | UCLA | NY | 218.44 |
| 3 | 3 | Mahmoud Abdul-Rauf | 1991 | 2001 | G | 6-1 | 162.0 | March 9, 1969 | LSU | MS | 185.42 |
| 4 | 4 | Tariq Abdul-Wahad | 1998 | 2003 | F | 6-6 | 223.0 | November 3, 1974 | Michigan, San Jose State | FR | 198.12 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 5018 | 5018 | Ante Žižić | 2018 | 2020 | F-C | 6-10 | 266.0 | January 4, 1997 | NaN | HR | 208.28 |
| 5019 | 5019 | Jim Zoet | 1983 | 1983 | C | 7-1 | 240.0 | December 20, 1953 | Kent State University | CA | 215.90 |
| 5020 | 5020 | Bill Zopf | 1971 | 1971 | G | 6-1 | 170.0 | June 7, 1948 | Duquesne | 185.42 | |
| 5021 | 5021 | Ivica Zubac | 2017 | 2022 | C | 7-0 | 240.0 | March 18, 1997 | NaN | BA | 213.36 |
| 5022 | 5022 | Matt Zunic | 1949 | 1949 | G-F | 6-3 | 195.0 | December 19, 1919 | George Washington | PA | 190.50 |
5023 rows × 11 columns
We can now add the countries to the players code, to be able to plot it later on.
Players = pd.merge(Players.copy(),country_code_alpha,left_on="Country",right_on="Alpha-2 code",how="outer")
Players = pd.merge(Players.copy(),state_code,left_on="Country",right_on="Alpha code",how="outer")
Players["AlphaThree Code"].fillna(Players["Alpha-3 code"],inplace=True)
Players["Alpha-3 code"] = Players["AlphaThree Code"]
Players=Players[~Players["Player"].isnull()]
Players
| Unnamed: 0 | Player | From | To | Pos | Ht | Wt | Birth Date | Colleges | Country_x | HtCm | Alpha-2 code | Alpha-3 code | Country_y | Alpha code | AlphaThree Code | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | Alaa Abdelnaby | 1991.0 | 1995.0 | F-C | 6-10 | 240.0 | June 24, 1968 | Duke | EG | 208.28 | EG | EGY | NaN | NaN | EGY |
| 1 | 3209.0 | Abdel Nader | 2018.0 | 2022.0 | F | 6-5 | 225.0 | September 25, 1993 | Northern Illinois, Iowa State | EG | 195.58 | EG | EGY | NaN | NaN | EGY |
| 2 | 1.0 | Zaid Abdul-Aziz | 1969.0 | 1978.0 | C-F | 6-9 | 235.0 | April 7, 1946 | Iowa State | NY | 205.74 | NaN | USA | New York | NY | USA |
| 3 | 2.0 | Kareem Abdul-Jabbar* | 1970.0 | 1989.0 | C | 7-2 | 225.0 | April 16, 1947 | UCLA | NY | 218.44 | NaN | USA | New York | NY | USA |
| 4 | 12.0 | Don Ackerman | 1954.0 | 1954.0 | G | 6-0 | 183.0 | September 4, 1930 | Long Island University | NY | 182.88 | NaN | USA | New York | NY | USA |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 5018 | 4009.0 | Ha Seung-Jin | 2005.0 | 2006.0 | C | 7-3 | 305.0 | August 4, 1985 | NaN | KR | 220.98 | KR | KOR | NaN | NaN | KOR |
| 5019 | 4338.0 | Edy Tavares | 2016.0 | 2017.0 | C | 7-3 | 260.0 | March 22, 1992 | NaN | CV | 220.98 | CV | CPV | NaN | NaN | CPV |
| 5020 | 4375.0 | Hasheem Thabeet | 2010.0 | 2014.0 | C | 7-3 | 263.0 | February 16, 1987 | UConn | TZ | 220.98 | TZ | TZA | NaN | NaN | TZA |
| 5021 | 4469.0 | Óscar Torres | 2002.0 | 2003.0 | F | 6-6 | 210.0 | December 18, 1976 | NaN | VE | 198.12 | VE | VEN | NaN | NaN | VEN |
| 5022 | 4560.0 | Greivis Vásquez | 2011.0 | 2017.0 | G | 6-6 | 217.0 | January 16, 1987 | Maryland | VE | 198.12 | VE | VEN | NaN | NaN | VEN |
5023 rows × 16 columns
Now, to process the map dataset obtained from google trends, we will rename the columns, and convert the country names to alpha-3 codes.
new_header = geo.iloc[0]
geo = geo[1:]
geo.columns = new_header
geo = geo.loc[geo["Basketball: (01/01/2004 - 20/09/2022)"].notnull()]
geo = pd.merge(geo,pd.read_csv("List of Countries by number of Internet Users - Sheet1.csv"),left_index=True,right_on="Country or Area",how="inner")
geo["Basketball: (01/01/2004 - 20/09/2022)"] = pd.to_numeric(geo["Basketball: (01/01/2004 - 20/09/2022)"] )* pd.to_numeric(geo["Internet Users"].str.replace(",",""))
geo = pd.merge(geo,country_code,left_on="Country or Area",right_on="COUNTRY")
geo = pd.merge(geo,country_code_alpha,left_on="ISO CODES",right_on="Alpha-2 code")
geo["Basketball: (01/01/2004 - 20/09/2022)"] = geo["Basketball: (01/01/2004 - 20/09/2022)"]/geo['Basketball: (01/01/2004 - 20/09/2022)'].max()
geo
| Basketball: (01/01/2004 - 20/09/2022) | Country or Area | Internet Users | Population | Rank | Percentage | Rank.1 | COUNTRY | ISO CODES | Alpha-2 code | Alpha-3 code | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.016413 | Lithuania | 2,243,448 | 2,890,297 | 115 | 77.62% | 58 | Lithuania | LT | LT | LTU |
| 1 | 1.000000 | United States | 244,090,854 | 324,459,463 | 3 | 75.23% | 68 | United States | US | US | USA |
| 2 | 0.001836 | Montenegro | 448,260 | 628,960 | 154 | 71.27% | 75 | Montenegro | ME | ME | MNE |
| 3 | 0.000083 | Marshall Islands | 20,560 | 53,127 | 203 | 38.70% | 138 | Marshall Islands | MH | MH | MHL |
| 4 | 0.029671 | Greece | 7,799,565 | 11,159,773 | 58 | 69.89% | 77 | Greece | GR | GR | GRC |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 179 | 0.067502 | India | 461,347,554 | 1,339,180,127 | 2 | 34.45% | 145 | India | IN | IN | IND |
| 180 | 0.001831 | Sudan | 12,512,639 | 40,533,330 | 46 | 30.87% | 153 | Sudan | SD | SD | SDN |
| 181 | 0.001104 | Yemen | 7,548,512 | 28,250,420 | 62 | 26.72% | 164 | Yemen | YE | YE | YEM |
| 182 | 0.009048 | Pakistan | 61,837,331 | 220,800,300 | 25 | 30.68% | 184 | Pakistan | PK | PK | PAK |
| 183 | 0.004467 | Bangladesh | 30,530,435 | 164,669,751 | 27 | 18.02% | 180 | Bangladesh | BD | BD | BGD |
184 rows × 11 columns
We process the SalaryData, and make the salary and season numeric.
salaryData["Salary"] = salaryData["Salary"].str.replace(",","").str[1:]
salaryData["Salary"] = pd.to_numeric(salaryData["Salary"], errors="coerce")
salaryData["Season"] = pd.to_numeric(salaryData["Season"].str[:4])
salaryData
| Unnamed: 0 | Season | Age | Tm | Lg | Pos | G | GS | MP | FG | ... | AST | STL | BLK | TOV | PF | PTS | Team | Salary | Unnamed: 30 | Trp Dbl | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 1990 | 22.0 | POR | NBA | PF | 5.0 | 0.0 | 2.6 | 0.4 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.8 | Portland Trail Blazers | 395000.0 | NaN | NaN |
| 1 | 1 | 1991 | 23.0 | POR | NBA | PF | 8.0 | 0.0 | 3.1 | 0.6 | ... | 0.3 | 0.0 | 0.0 | 0.3 | 0.5 | 1.5 | Portland Trail Blazers | 494000.0 | NaN | NaN |
| 2 | 2 | 1992 | 24.0 | BOS | NBA | PF | 4.0 | 4.0 | 17.0 | 2.8 | ... | 0.3 | 0.0 | 0.3 | 2.3 | 1.8 | 5.5 | Boston Celtics | 500000.0 | NaN | NaN |
| 3 | 0 | 1984 | 37.0 | LAL | NBA | C | 19.0 | 19.0 | 32.1 | 8.8 | ... | 4.0 | 1.2 | 1.9 | 2.7 | 3.5 | 21.9 | Los Angeles Lakers | 1530000.0 | NaN | NaN |
| 4 | 1 | 1985 | 38.0 | LAL | NBA | C | 14.0 | 14.0 | 34.9 | 11.2 | ... | 3.5 | 1.1 | 1.7 | 3.0 | 3.9 | 25.9 | Los Angeles Lakers | 2030000.0 | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 10759 | 3 | 2018 | 21.0 | LAL | NBA | C | 33.0 | 12.0 | 15.6 | 3.4 | ... | 0.8 | 0.1 | 0.8 | 1.0 | 2.2 | 8.5 | Los Angeles Clippers | 1544951.0 | NaN | NaN |
| 10760 | 4 | 2018 | 21.0 | LAC | NBA | C | 26.0 | 25.0 | 20.2 | 3.8 | ... | 1.5 | 0.4 | 0.9 | 1.4 | 2.5 | 9.4 | Los Angeles Clippers | 1544951.0 | NaN | NaN |
| 10761 | 5 | 2019 | 22.0 | LAC | NBA | C | 72.0 | 70.0 | 18.4 | 3.3 | ... | 1.1 | 0.2 | 0.9 | 0.8 | 2.3 | 8.3 | Los Angeles Clippers | 6481482.0 | NaN | NaN |
| 10762 | 6 | 2020 | 23.0 | LAC | NBA | C | 72.0 | 33.0 | 22.3 | 3.6 | ... | 1.3 | 0.3 | 0.9 | 1.1 | 2.6 | 9.0 | Los Angeles Clippers | 7000000.0 | NaN | NaN |
| 10763 | 7 | 2021 | 24.0 | LAC | NBA | C | 76.0 | 76.0 | 24.4 | 4.1 | ... | 1.6 | 0.5 | 1.0 | 1.5 | 2.7 | 10.3 | Los Angeles Clippers | 7518518.0 | NaN | NaN |
10764 rows × 35 columns
To account for inflation we can take in the inflation_data, and process it
inflation_data.index = inflation_data["year"]
inflation_data = inflation_data.drop("year",axis=1)
inflation_data = inflation_data.loc[1984:]
inflation_data = inflation_data["amount"]
inflation_data = inflation_data/inflation_data.loc[2022]
inflation_data
year 1984 0.350915 1985 0.363250 1986 0.370055 1987 0.383667 1988 0.399405 1989 0.418545 1990 0.441089 1991 0.459804 1992 0.473416 1993 0.487877 1994 0.500213 1995 0.514675 1996 0.529562 1997 0.541897 1998 0.550404 1999 0.562314 2000 0.581455 2001 0.598043 2002 0.607401 2003 0.621012 2004 0.637601 2005 0.659294 2006 0.680561 2007 0.700128 2008 0.726925 2009 0.724373 2010 0.736282 2011 0.759251 2012 0.774989 2013 0.786474 2014 0.799234 2015 0.800085 2016 0.810293 2017 0.827308 2018 0.848150 2019 0.863037 2020 0.873671 2021 0.914930 2022 1.000000 Name: amount, dtype: float64
We will now process the games data
details.drop(["GAME_ID","TEAM_ID","PLAYER_ID"],axis=1)
details.drop_duplicates(subset=["GAME_ID","PLAYER_ID"],keep="first",inplace=True)
details = details.groupby(["GAME_ID","TEAM_ID"]).sum()
details = details.reset_index()
details = details.drop(['PLAYER_ID', 'FG_PCT','FG3_PCT','FT_PCT','PLUS_MINUS'],axis=1)
details["FT_PCT"] = details["FTM"]/details["FTA"]*100
details["FG3_PCT"] = details["FG3M"]/details["FG3A"]*100
details["FG_PCT"] = details["FGM"]/details["FGA"]*100
details = details.sort_values("GAME_ID") #Sort the rows by GAME_ID, it is an extra check to avoid any bug in the next rows of the code
details["VICTORY"] = ""
for i in range(0,len(details)-1,2):
if details["PTS"][i] < details["PTS"][i+1]:
details.loc[i , "VICTORY"] = "Loss"
details.loc[i+1 , "VICTORY"] = "Win"
else:
details.loc[i , "VICTORY"] = "Win"
details.loc[i+1 , "VICTORY"] = "Loss"
details = pd.merge(details,games,how="left",on="GAME_ID")
Let's first analyze the physical characterestics of the players over time.
HeightHistogram = sns.histplot(Players["HtCm"])
HeightHistogram.set(xlabel="Height of players (cm)")
[Text(0.5, 0, 'Height of players (cm)')]
The gaps between bars can be explained by the fact that due to the nature of the data, there are 2.5 cm gaps between each measurement. We observe that the distribution is slightly left skewed.
Now, in the original Players dataframe, we are only given the year of entrance and the year the players left, so to get the data for the players that played at each individual year, we iterate through every "from-to" pair, and get separate rows for each year.
PlayersByYears = Players.copy()
decompose = lambda x: [i for i in range(int(x["From"]),int(x["To"]))]
PlayersByYears["Years"] = PlayersByYears.apply(decompose,axis=1)
PlayersByYears = PlayersByYears.explode("Years")
PlayersByYears.sort_values(by="Years",inplace=True)
PlayersByYears=PlayersByYears[PlayersByYears["Years"].notnull()]
PlayersByYears["YearsNum"] = pd.to_numeric(PlayersByYears["Years"])
PlayersByYears['YearsNum'] = pd.to_numeric(PlayersByYears['YearsNum'])
PlayersByYears["bins"] = pd.cut(PlayersByYears['YearsNum'], bins=np.arange(1945,2026,5),labels = [str(1945+int(i)*5)+"-"+str(1945+(int(i)+1)*5) for i in range(-1+len(np.arange(1945,2026,5)))])
PlayersByYears = PlayersByYears.loc[~PlayersByYears.index.duplicated(), :]
PlayersByYears
| Unnamed: 0 | Player | From | To | Pos | Ht | Wt | Birth Date | Colleges | Country_x | HtCm | Alpha-2 code | Alpha-3 code | Country_y | Alpha code | AlphaThree Code | Years | YearsNum | bins | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1297 | 3295.0 | Stan Noszka | 1947.0 | 1949.0 | G | 6-1 | 185.0 | September 9, 1920 | Duquesne | PA | 185.42 | PA | USA | Pennsylvania | PA | USA | 1947 | 1947 | 1945-1950 |
| 3095 | 3855.0 | Giff Roux | 1947.0 | 1949.0 | F-C | 6-5 | 195.0 | June 28, 1923 | Kansas | MO | 195.58 | MO | USA | Missouri | MO | USA | 1947 | 1947 | 1945-1950 |
| 278 | 3294.0 | George Nostrand | 1947.0 | 1950.0 | C-F | 6-8 | 195.0 | January 25, 1924 | Wyoming | NY | 203.20 | NaN | USA | New York | NY | USA | 1947 | 1947 | 1945-1950 |
| 2009 | 3127.0 | Elmore Morgenthaler | 1947.0 | 1949.0 | C | 6-9 | 230.0 | August 3, 1922 | New Mexico Tech, Boston College | TX | 205.74 | NaN | USA | Texas | TX | USA | 1947 | 1947 | 1945-1950 |
| 2463 | 2758.0 | John Mahnken | 1947.0 | 1953.0 | C | 6-8 | 220.0 | June 16, 1922 | Georgetown | NJ | 203.20 | NaN | USA | New Jersey | NJ | USA | 1947 | 1947 | 1945-1950 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 171 | 2050.0 | Elijah Hughes | 2021.0 | 2022.0 | F | 6-5 | 215.0 | March 10, 1998 | Syracuse, East Carolina University | NY | 195.58 | NaN | USA | New York | NY | USA | 2021 | 2021 | 2020-2025 |
| 1978 | 2315.0 | Mason Jones | 2021.0 | 2022.0 | G | 6-4 | 200.0 | July 21, 1998 | Connors State College, Arkansas | TX | 193.04 | NaN | USA | Texas | TX | USA | 2021 | 2021 | 2020-2025 |
| 1434 | 200.0 | LaMelo Ball | 2021.0 | 2022.0 | G | 6-7 | 180.0 | August 22, 2001 | NaN | CA | 200.66 | CA | USA | California | CA | USA | 2021 | 2021 | 2020-2025 |
| 3421 | 4146.0 | Jalen Smith | 2021.0 | 2022.0 | F | 6-10 | 215.0 | March 16, 2000 | Maryland | VA | 208.28 | VA | USA | Virginia | VA | USA | 2021 | 2021 | 2020-2025 |
| 3465 | 3011.0 | Sam Merrill | 2021.0 | 2022.0 | G | 6-4 | 205.0 | May 15, 1996 | Utah State University | UT | 193.04 | NaN | USA | Utah | UT | USA | 2021 | 2021 | 2020-2025 |
3598 rows × 19 columns
sns.set_theme(style="white", rc={"axes.facecolor": (0, 0, 0, 0)})
g = sns.FacetGrid(PlayersByYears, row="bins", aspect=20, height=0.8)
g.map_dataframe(sns.kdeplot, x="HtCm",fill=True, alpha=1)
g.map_dataframe(sns.kdeplot, x="HtCm", color='black')
g.fig.subplots_adjust(hspace=-0.3)
g.set_titles("")
g.set_ylabels("Density")
g.set(xlabel="Height of players (cm)")
#g.xaxis.get_label()
g.set(yticks=[])
g.despine(left=True)
<seaborn.axisgrid.FacetGrid at 0x17244b799d0>
fig = plt.gcf()
fig.set_size_inches(20.5, 10.5)
plt.bar(range(0,len(PlayersByYears.groupby("bins").mean().index)), PlayersByYears.groupby("bins").mean()["HtCm"], align='center', width=1)
plt.xticks( range(0,len(PlayersByYears.groupby("bins").mean().index)), PlayersByYears.groupby("bins").mean().index)
ax = plt.gca()
ax.set(xlabel="Years",ylabel="Height of players (cm)")
#ax.bar(range(len(data)), values,)
ax.set_ylim([185,205])
plt.show()
fig = plt.gcf()
fig.set_size_inches(20.5, 10.5)
#sns.regplot(x=range(0,len(PlayersByYears.groupby("bins").mean().index)),y=PlayersByYears.groupby("bins").mean()["Wt"])
plt.bar(range(0,len(PlayersByYears.groupby("bins").mean().index)), PlayersByYears.groupby("bins").mean()["Wt"], align='center', width=1)
plt.xticks( range(0,len(PlayersByYears.groupby("bins").mean().index)), PlayersByYears.groupby("bins").mean().index)
ax = plt.gca()
ax.set(xlabel="Years",ylabel="Weight of players (lbs)")
plt.show()
This tells us that the range of the heights of the players over time seem to be decreasing. Taking a look at the mean height over time, the height seems to initially be increasing over time, but then plateaus. We see that recently, the height starts to decrease again (the red bar indicates covid).
The weight of the players however, remains relatively constant. This implies that the shorter basketball players are much better built, and the taller students are more skinny.
Now we take a look at the countries where these players are from.
fig = go.Figure(data=go.Choropleth(
locations = Players.groupby("AlphaThree Code").count()["Player"].index,
z = Players.groupby("AlphaThree Code").count()["Player"],
#text = geo["Country or Area"],
colorscale=[[0, 'rgb(0,0,0)'], [1,'rgb(255,0,0)']],
autocolorscale=True,
reversescale=True,
#marker_line_color='viridis',
marker_line_width=0.5,
colorbar_title = 'Number of Players',
))
fig.show()
it is evident that most players are from the Us. This makers sense, as the NBA is founded and most popular. Very interestingly, tthere are many players from other countries, but lets take a look at the players over time.
sortedGroupeddf = pd.DataFrame(PlayersByYears.groupby(["AlphaThree Code","YearsNum"]).count().reset_index()).sort_values(by="YearsNum")
sortedGroupeddf
| AlphaThree Code | YearsNum | Unnamed: 0 | Player | From | To | Pos | Ht | Wt | Birth Date | Colleges | Country_x | HtCm | Alpha-2 code | Alpha-3 code | Country_y | Alpha code | Years | bins | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 261 | USA | 1947 | 71 | 71 | 71 | 71 | 71 | 71 | 70 | 71 | 66 | 71 | 71 | 32 | 71 | 71 | 71 | 71 | 71 |
| 189 | NLD | 1947 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 1 |
| 262 | USA | 1948 | 23 | 23 | 23 | 23 | 23 | 23 | 23 | 23 | 22 | 23 | 23 | 15 | 23 | 23 | 23 | 23 | 23 |
| 263 | USA | 1949 | 49 | 49 | 49 | 49 | 49 | 49 | 49 | 49 | 48 | 49 | 49 | 28 | 49 | 49 | 49 | 49 | 49 |
| 264 | USA | 1950 | 59 | 59 | 59 | 59 | 59 | 59 | 59 | 59 | 56 | 59 | 59 | 29 | 59 | 59 | 59 | 59 | 59 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 152 | JAM | 2021 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 1 |
| 37 | BRA | 2021 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 1 |
| 115 | GIN | 2021 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 1 |
| 188 | NGA | 2021 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 0 | 0 | 2 | 2 |
| 166 | LTU | 2021 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 1 |
340 rows × 19 columns
import plotly.express as px
fig = px.choropleth(sortedGroupeddf,
locations = 'AlphaThree Code',
color="Player",
animation_frame="YearsNum",
color_continuous_scale="viridis",
#locationmode='USA-states',
#scope="usa",
range_color=(0, 10),
height=600 )
fig.layout.coloraxis.colorbar.title = 'Number of Players'
fig.show()
import plotly.express as px
px.choropleth(pd.DataFrame(PlayersByYears[PlayersByYears["AlphaThree Code"]=="USA"].groupby(["Alpha code","YearsNum"]).count().reset_index()).sort_values(by="YearsNum"),
locations = 'Alpha code',
color="Player",
animation_frame="YearsNum",
color_continuous_scale="viridis",
locationmode='USA-states',
scope="usa",
range_color=(0, 14),
height=600
)
sortedGroupeddf.groupby("YearsNum").nunique()["AlphaThree Code"].plot(xlabel="Years",ylabel = "Number of countries")
<AxesSubplot:xlabel='Years', ylabel='Number of countries'>
We see that over time, the number of players from different countries have increased. Looking at the US alone, we see that many NBA players are from either the east or west coast. This could be due to the fact that basketball is promoted a lot more in these areas. This could also be due to the fact that most people live near the coasts. Now that we have analyzed the players, lets take a look at the viewers watching the NBA
Now, we do not have direct access to the NBA viewership across the years, but what we can do is access the google trends data for Basketball. Now, there is an issue. Since the google trends data is normalized by population, we must correct for it by multiplying it by the number of internet users in each country.
fig = go.Figure(data=go.Choropleth(
locations = geo['Alpha-3 code'],
z = np.log(1+geo["Basketball: (01/01/2004 - 20/09/2022)"])/np.log(1+geo["Basketball: (01/01/2004 - 20/09/2022)"]).max(),
text = geo["Country or Area"],
colorscale="pinkyl",
autocolorscale=True,
reversescale=True,
#marker_line_color='viridis',
marker_line_width=0.5,
colorbar_title = 'Normalized Viewership on logarithmic scale',
))
fig.show()
We see a lot of viewers from the United States, and surprisingly a lot of viewers from China. This makes sense sense, since the NBA has been putting deliberate attention and work into capturing the Chinese market.
Lets take a look at the interest across time.
overallTime = pd.DataFrame([pd.read_csv("multiTimeline (2).csv").reset_index().loc[2:,"Category: All categories"],pd.to_numeric(pd.read_csv("multiTimeline (2).csv").reset_index().loc[2:,"Unnamed: 1"])]).T
for i in range(0,12*(len(overallTime["Category: All categories"])//12),12):
yearlyPlot = sns.lineplot(x = range(12),y= overallTime["Unnamed: 1"].iloc[i:i+12],color="blue")
yearlyPlot.set(xlabel="months",ylabel = "normalized interest")
allWeeks = pd.DataFrame(pd.to_numeric(pd.read_csv("multiTimeline_1.csv").reset_index().loc[1:,"Category: Sports"]))
allWeeks.columns = ["2006"]
for i in range(2,18):
thisWeeks = pd.DataFrame(pd.to_numeric(pd.read_csv("multiTimeline_"+str(i)+".csv").reset_index().loc[1:,"Category: Sports"]))
thisWeeks.columns = [str(2006+i-1)]
allWeeks = pd.concat([allWeeks, thisWeeks], axis=1)
allWeeks
hotmap = sns.heatmap(allWeeks.T)
hotmap.set(xlabel="Weeks",ylabel="Years")
plt.show()
We see that across the years, there is a peak popularity around the 11 week mark. This coincides with the NBA All Stars events that happens every year around this time. We need to note that each year is normalized, so to get the true picture of its popularity across the years, we plot the overall timeline of the google trends data.
overallTime = pd.DataFrame([pd.read_csv("multiTimeline (2).csv").reset_index().loc[2:,"Category: All categories"],pd.to_numeric(pd.read_csv("multiTimeline (2).csv").reset_index().loc[2:,"Unnamed: 1"])]).T
fig = plt.gcf()
fig.set_size_inches(18.5, 6.5)
plt.plot(overallTime["Unnamed: 1"].iloc[:-33])
plt.plot(overallTime["Unnamed: 1"].iloc[-34:], color="red")
plt.xticks(range(0,len(overallTime["Category: All categories"]),24),[overallTime["Category: All categories"].iloc[i] for i in range(0,len(overallTime["Category: All categories"]),24)])
ax = plt.gca()
ax.set(xlabel="Date", ylabel="normalized interest")
plt.show()
plt=reload(plt)
fig = tsaplots.plot_acf(overallTime["Unnamed: 1"], lags=40)
plt.xlabel("Months Lag")
plt.ylabel("Months Lag")
plt.show()
We see the spikes of popularity as explained earlier, however we also see that the popularity of basketball has been increasing over time. Suddenly, due to the covid 19 pandemic (highlighted in red), we see that there is a sharp decrease in its popularity. Despite the virus, the sports popularity seems to continue to grow rapidly as time progresses.
From the autocorrelation plot, we can observe that the data is periodic and correlated to itself with a lag of 12 months. This confirms the data is periodic, and implies the idea that the variation is due to changes across the year.
Now let's analyze the salaries that these players earn, and what are the factors affecting them.
sns.set(rc={'figure.figsize':(41.7,8.27)})
sns.boxplot(x = salaryData["Age"],y=salaryData["Salary"])
sns.stripplot(x = salaryData["Age"],y=salaryData["Salary"],jitter = 0.4)
<AxesSubplot:xlabel='Age', ylabel='Salary'>
As the players get older, they get paid more. This makes sense. They gain experience, become fan favourites and get better at the game. However, when players get too old, their pay decreases. This could be because with age, players performance tends to decrease due to natural causes. How about the pay for all ages over time?
sns.set(rc={'figure.figsize':(41.7,8.27)})
sns.boxplot(x = salaryData["Season"],y=salaryData["Salary"])
sns.stripplot(x = salaryData["Season"],y=salaryData["Salary"],jitter = 0.4)
<AxesSubplot:xlabel='Season', ylabel='Salary'>
#sns.regplot()
sns.set(rc={'figure.figsize':(15.7,8.27)})
salyear = sns.regplot(x = salaryData.groupby("Season").median().index, y = salaryData.groupby("Season").median()["Salary"])
salyear.set(xlabel = "Year")
[Text(0.5, 0, 'Year')]
We can see that over the years, the salary has generally increased. We also see that the range of the salaries also increase. The maximum salary also tends to increase. This could be due to the rise of popularity of the sport at large.
sns.set(rc={'figure.figsize':(41.7,8.27)})
sns.stripplot(x = salaryData["Pos"],y=salaryData["Salary"],jitter=0.3,alpha=0.2)
sns.violinplot(x = salaryData["Pos"],y=salaryData["Salary"],inner="quartile")
<AxesSubplot:xlabel='Pos', ylabel='Salary'>
salaryData.groupby("Pos").median()["Salary"].sort_values().plot(kind="bar",ylabel="Median Salary")
<AxesSubplot:xlabel='Pos', ylabel='Median Salary'>
the five positions are known by unique names: point guard (PG), the shooting guard (SG), the small forward (SF), the power forward (PF), and the center (C)
We see that players who are able to play multiple positions like PG, PF and SG get paid the most, while other positions alone, like PG or PG and SG are not paid as much. Interestingly for C, PF, there seems to be a bimodal distribution, probably because there are different levels of players that exist within this category. However, this data has a lot of outliers,so for further analysis, we will remove the top and bottom 1%.
lst = [ "Salary"]
salaryData_filtered = salaryData.copy()
for i in lst:
q_low = salaryData_filtered[i].quantile(0.01)
q_hi = salaryData_filtered[i].quantile(0.99)
salaryData_filtered = salaryData_filtered[(salaryData_filtered[i] < q_hi) & (salaryData_filtered[i] > q_low)]
salaryData_filtered = salaryData_filtered.drop(["Unnamed: 0","Unnamed: 30","Trp Dbl"],axis=1)
sns.set(rc={'figure.figsize':(15.7,8.27)})
sns.pairplot(salaryData_filtered[["Age", "G","FG%", "3P%", "2P%", "eFG%", "FT%", "Salary"]])
<seaborn.axisgrid.PairGrid at 0x1725c3baf70>
plt.barh(np.arange(len(np.abs(salaryData_filtered.corr()["Salary"]).sort_values()))[:-2],np.abs(salaryData_filtered.corr()["Salary"]).sort_values().values[:-2])
plt.yticks(np.arange(len(np.abs(salaryData_filtered.corr()["Salary"]).sort_values()))[:-2],np.abs(salaryData_filtered.corr()["Salary"]).sort_values().index[:-2])
ax = plt.gca()
ax.set(xlabel="Correlation Coefficient")
plt.show()
We see that there is no strong correlation between the salary and any other factor, such as 3 point attempts, and other performance-related metrics. The strongest metric is the season, which as explained earlier, could be due to the rise of popularity of the sport.
fig = plt.figure(figsize=(20,16))
gs = fig.add_gridspec(5, 8, hspace=0.2, wspace=0)
axes = gs.subplots(sharex=False, sharey=False)
j = 0
for i in salaryData_filtered["Season"].unique():
j=j+1
features = salaryData_filtered[salaryData_filtered["Season"]==i][["Age", "G","FG%", "3P%", "2P%", "eFG%", "FT%","Salary"]].copy()
mlm = LinearRegression()
features.dropna(axis=0, inplace= True)
X_train2, X_test2, y_train2, y_test2 = train_test_split(features[["Age", "G","FG%", "3P%", "2P%", "eFG%", "FT%"]],features["Salary"], test_size=0.2, random_state=0)
mlm.fit(X_train2, y_train2)
yhat2 = mlm.predict(X_test2)
gg = sns.kdeplot(y_test2, color='r', label='Actual Value', ax = axes[j//8,j%8])
gg = sns.kdeplot(yhat2, color='b', label='Fitted Value', ax = axes[j//8,j%8])
gg.set(xlabel=None)
gg.set(ylabel=None)
plt.show()
from sklearn.tree import DecisionTreeRegressor
from sklearn import tree
fig = plt.figure(figsize=(20,16))
gs = fig.add_gridspec(5, 8, hspace=0.2, wspace=0)
axes = gs.subplots(sharex=False, sharey=False)
j = 0
for i in salaryData_filtered["Season"].unique():
j=j+1
features = salaryData_filtered[salaryData_filtered["Season"]==i][["Age", "G","FG%", "3P%", "2P%", "eFG%", "FT%","Salary"]].copy()
mlm = tree.DecisionTreeRegressor(max_depth=4, criterion="squared_error")
features.dropna(axis=0, inplace= True)
X_train2, X_test2, y_train2, y_test2 = train_test_split(features[["Age", "G","FG%", "3P%", "2P%", "eFG%", "FT%"]],features["Salary"], test_size=0.2, random_state=0)
mlm.fit(X_train2, y_train2)
yhat2 = mlm.predict(X_test2)
gg = sns.kdeplot(y_test2, color='r', label='Actual Value', ax = axes[j//8,j%8])
gg = sns.kdeplot(yhat2, color='b', label='Fitted Value', ax = axes[j//8,j%8])
gg.set(xlabel=None)
gg.set(ylabel=None)
plt.show()
Neither a multiple linear regressor model, nor a Random Forest approach seems to accurately predict the salary of the NBA players for eacg year. This is probably due to other factors like how popular the athletes are, and their performance in each iondividual game.
Plotting the salary distribution over time, we get this:
sns.set_theme(style="white", rc={"axes.facecolor": (0, 0, 0, 0)})
g = sns.FacetGrid(salaryData_filtered, row="Season", aspect=20, height=0.8)
g.map_dataframe(sns.kdeplot, x="Salary",fill=True, alpha=1)
g.map_dataframe(sns.kdeplot, x="Salary", color='black')
g.fig.subplots_adjust(hspace=-.9)
g.set_titles("")
g.set_ylabels("Density")
#g.xaxis.get_label()
g.set(yticks=[])
g.despine(left=True)
<seaborn.axisgrid.FacetGrid at 0x17262ef3ee0>
Over time, the distribution of the salaries spreads out! This is very interesting, and could be due to the sharp increase of popularity of certain players like Michael Jordan, Lebron James, etc.
This causes the NBA to pay them more, causing the range of the distribution to increase.
sns.set(rc={'figure.figsize':(15.7,13)})
sns.heatmap(salaryData_filtered.corr(),vmin=-1, vmax=1, cmap="vlag")
<AxesSubplot:>
Unsurprisingly, we note that there is high correlation between different shooting metrics. There is a weak correlation between other paramters like "Season", "Salary" and "Age".
plt.plot(salaryData_filtered.groupby("Season").mean()["3PA"])
ax = plt.gca()
ax.set(xlabel="years",ylabel="Mean 3 point attempts")
[Text(0.5, 0, 'years'), Text(0, 0.5, 'Mean 3 point attempts')]
plt.plot(salaryData_filtered.groupby("Season").mean()["2PA"])
ax = plt.gca()
ax.set(xlabel="years",ylabel="Mean 2 point attempts")
[Text(0.5, 0, 'years'), Text(0, 0.5, 'Mean 2 point attempts')]
fig, ax = plt.subplots(1,3,figsize=(15,8),dpi=200)
sns.set_theme(style="white", rc={"axes.facecolor": (1, 1, 1, 1)})
plt.suptitle("Link between percentage of each types of shots and winning")
sns.scatterplot(ax = ax[0],data=details,x="FG3A",y="FG3_PCT",hue="VICTORY",style="VICTORY",alpha=0.8)
ax[0].set(ylabel="Three point shots percentage", xlabel="Three point shots attempted")
sns.scatterplot(ax = ax[1],data=details,x="FGA",y="FG_PCT",hue="VICTORY",style="VICTORY",alpha=0.8)
ax[1].set(ylabel="Field goal percentage", xlabel="Field goal attempted")
sns.scatterplot(ax = ax[2],data=details,x="FTA",y="FT_PCT",hue="VICTORY",style="VICTORY",alpha=0.8)
ax[2].set(ylabel="Free throw percentage", xlabel="Free throw attempted")
plt.show()
fig = plt.figure()
#ax = Axes3D(fig)
ax = fig.add_subplot(111, projection='3d')
my_color = details["VICTORY"].unique()
ax.scatter(details['FG3_PCT'][details["VICTORY"]=="Loss"], details['FG_PCT'][details["VICTORY"]=="Loss"], details['FT_PCT'][details["VICTORY"]=="Loss"], c="red", s=60)
ax.scatter(details['FG3_PCT'][details["VICTORY"]=="Win"], details['FG_PCT'][details["VICTORY"]=="Win"], details['FT_PCT'][details["VICTORY"]=="Win"], c="blue", s=60)
ax.set_xlabel("FG3_PCT")
ax.set_ylabel("FG_PCT")
ax.set_zlabel("FT_PCT")
ax.view_init(45,135)
plt.show()
The NBA changed the distance of the three-point line in the 1990s. The original intention was to create more high scoring games. We see that due to this change, almost immediately, the amount of 3 Point attempts shot up dramatically. The number of 2 point attempts on the other hand, started decreasing with the years. Critiques say that this change has made the game less aggressive, however the change is still very well appreciated (as can be seen by the increased viewership).
Taking a look at the victories, it becomes very apparent that a high number of 3 point attempts helps
fig = plt.gcf()
fig.set_size_inches(20.5, 10.5)
plt.bar(range(0,len(PlayersByYears.groupby("bins").mean().index)), PlayersByYears.groupby("bins").mean()["HtCm"],color=["blue"]*15+["red"], align='center', width=1)
plt.xticks( range(0,len(PlayersByYears.groupby("bins").mean().index)), PlayersByYears.groupby("bins").mean().index)
ax = plt.gca()
ax.set(xlabel="Years",ylabel="Height of players (cm)")
#ax.bar(range(len(data)), values,)
ax.set_ylim([185,205])
plt.show()
Players heights tend to increase over time, from 1945-1985. Note that the origin of the y axis starts at 185 cm. This is to emphasize the difference in height, as small differences in height are definitely significant. The height plateaus from then onwards up until the 2020 bin. Here we suddenly see a sharp decrease in the average height.
This was the year when the pandemic hit, and it possibly hindered the outreach of the players, and the pools of players that recruiters select from.
fig = plt.gcf()
fig.set_size_inches(20.5, 10.5)
#sns.regplot(x=range(0,len(PlayersByYears.groupby("bins").mean().index)),y=PlayersByYears.groupby("bins").mean()["Wt"])
plt.bar(range(0,len(PlayersByYears.groupby("bins").mean().index)), PlayersByYears.groupby("bins").mean()["Wt"],color=["blue"]*15+["red"], align='center', width=1)
plt.xticks( range(0,len(PlayersByYears.groupby("bins").mean().index)), PlayersByYears.groupby("bins").mean().index)
ax = plt.gca()
ax.set(xlabel="Years",ylabel="Weight of players (lbs)")
plt.show()
The story, however, changes when we take a look at the weight of the players. We see that the weight of the players remain relatively constant. Throughout the years, regardless of the height of the players, or the pandemic, the weight of the players do not seem to fluctuate too greatly. This implies that the players nowadays are a lot more lean and tall, then the players used to be.
import plotly.express as px
fig = px.choropleth(sortedGroupeddf,
locations = 'AlphaThree Code',
color="Player",
animation_frame="YearsNum",
color_continuous_scale="viridis",
#locationmode='USA-states',
#scope="usa",
range_color=(0, 10),
height=600 )
fig.layout.coloraxis.colorbar.title = 'Number of Players'
fig.show()
Note: The scale is maxed out at 10 players to show exaggerate the difference between countries besides the U.S.
sns.regplot(x=sortedGroupeddf.groupby("YearsNum").nunique()["AlphaThree Code"].index,y=sortedGroupeddf.groupby("YearsNum").nunique()["AlphaThree Code"].values, order=2,color="Blue")
plt.plot(sortedGroupeddf.groupby("YearsNum").nunique()["AlphaThree Code"].index,sortedGroupeddf.groupby("YearsNum").nunique()["AlphaThree Code"].values)
ax = plt.gca()
ax.annotate('Increase in number of \ncountries players are from', xy=(2015, 14), xytext=(1980, 6),
arrowprops=dict(facecolor='black',
connectionstyle="angle3,angleA=0,angleB=-130"));
note: this is a quadratic fit, meant to illustrate that the number of countries are increasing, this is not an actual trend prediction.
We can see an increase of the number of countries that players are from. According to analysts, this is probably NBAs attempts at pandering to foreign crowds to try to increase worldwide viewership. This strategy worked incredibly well with NBAs massive hold on the Chinese market. Ever since the addition of Yao Ming, Chinese viewership of the sport skyrocketed. More details on this will be explained in the next question.
fig = go.Figure(data=go.Choropleth(
locations = geo['Alpha-3 code'],
z = np.log(1+geo["Basketball: (01/01/2004 - 20/09/2022)"])/np.log(1+geo["Basketball: (01/01/2004 - 20/09/2022)"]).max(),
text = geo["Country or Area"],
colorscale="pinkyl",
autocolorscale=True,
reversescale=True,
#marker_line_color='viridis',
marker_line_width=0.5,
colorbar_title = 'Normalized Viewership on logarithmic scale',
))
fig.show()
We see that the U.S. is the top viewer of the sport (probably because the NBA originated from there). There is also substantial viewership in China. According to Chinese people online from quora, this has 2 main reasons:
The popularity of the sport varies throughout the duration of the year, as well as across the years. Analyzing the popularity during each year itself we see the following:
allWeeks = pd.DataFrame(pd.to_numeric(pd.read_csv("multiTimeline_1.csv").reset_index().loc[1:,"Category: Sports"]))
allWeeks.columns = ["2006"]
for i in range(2,18):
thisWeeks = pd.DataFrame(pd.to_numeric(pd.read_csv("multiTimeline_"+str(i)+".csv").reset_index().loc[1:,"Category: Sports"]))
thisWeeks.columns = [str(2006+i-1)]
allWeeks = pd.concat([allWeeks, thisWeeks], axis=1)
allWeeks
px.imshow(allWeeks.T,labels=dict(x="Week", y="Year", color="Normalized popularity"),aspect="auto")
#hotmap.set(xlabel="Weeks",ylabel="Years")
#plt.show()
We see that across the years, there is a peak popularity around the 11 week mark. This coincides with the NBA All Stars events that happens every year around this time. We need to note that each year is normalized, so to get the true picture of its popularity across the years, we plot the overall timeline of the google trends data.
# Create figure
fig = go.Figure()
overallTime = pd.DataFrame([pd.read_csv("multiTimeline (2).csv").reset_index().loc[2:,"Category: All categories"],pd.to_numeric(pd.read_csv("multiTimeline (2).csv").reset_index().loc[2:,"Unnamed: 1"])]).T
fig.add_trace(
go.Scatter(y=overallTime["Unnamed: 1"].iloc[:-33], x=overallTime["Category: All categories"].iloc[:-33], name='Interest before covid'))
fig.add_trace(
go.Scatter(y=overallTime["Unnamed: 1"].iloc[-34:], x=overallTime["Category: All categories"].iloc[-34:], name='Interest after covid'))
# Sets title
fig.update_layout(
title_text="Timeseries of number of games released",
yaxis_title='Number of games',
xaxis_title='Date'
)
# Adds a range slider
fig.update_layout(
xaxis=dict(
rangeselector=dict(
buttons=list([
dict(count=6,
label="6m",
step="month",
stepmode="backward"),
dict(count=1,
label="YTD",
step="year",
stepmode="todate"),
dict(count=1,
label="1y",
step="year",
stepmode="backward"),
dict(step="all")
])
),
rangeslider=dict(
visible=True
),
type="date"
)
)
fig.show()
We see the spikes of popularity as explained earlier, however we also see that the popularity of basketball has been increasing over time. Suddenly, due to the covid 19 pandemic (highlighted in red), we see that there is a sharp decrease in its popularity. Despite the virus, the sports popularity seems to continue to grow rapidly as time progresses.
sns.set(rc={'figure.figsize':(41.7,8.27)})
#sns.stripplot(x = salaryData["Pos"],y=salaryData["Salary"],jitter=0.3,alpha=0.2)
#sns.violinplot(x = salaryData["Pos"],y=salaryData["Salary"],inner="quartile")
fig = go.Figure()
for i in salaryData["Pos"].unique():
fig.add_trace(go.Violin( y=salaryData["Salary"][salaryData["Pos"] == i], x=salaryData["Pos"][salaryData["Pos"] == i],name=i,
box_visible=True,
meanline_visible=True))
#fig.update_traces(meanline_visible=True)
#fig.update_layout(violingap=0, violinmode='overlay')
#fig.update_layout(
# autosize=False,
# width=2000,
# height=800,)
fig.show()
salaryData.groupby("Pos").median()["Salary"].sort_values().plot(kind="bar",ylabel="Median Salary")
<AxesSubplot:xlabel='Pos', ylabel='Median Salary'>
We can see that the position the player plays affects their salary greatly. Players who take on the PG and SG roles tend to have less then 20% of the players who take on the PG, SG and PF position.
#sns.regplot()
sns.set(rc={'figure.figsize':(15.7,8.27)})
salyear = sns.regplot(x = salaryData.groupby("Season").median().index[:], y = (salaryData.groupby("Season").median()["Salary"] * inflation_data)[:-1],color='b')
salyear = sns.regplot(x = salaryData.groupby("Season").median().index[-1:], y = (salaryData.groupby("Season").median()["Salary"] * inflation_data)[-2:-1],color='r')
salyear.set(xlabel = "Year")
[Text(0.5, 0, 'Year')]
We see that, even when accounting for inflation, the median salary of the NBA player has increased over time. This could be due to the increased popularity over time. With an increasing global market (especially from countries like China and Brazil), the NBA can afford to pay their players more over time.
sns.set_theme(style="white", rc={"axes.facecolor": (0, 0, 0, 0)})
g = sns.FacetGrid(salaryData_filtered, row="Season", aspect=20, height=0.8)
g.map_dataframe(sns.kdeplot, x="Salary",fill=True, alpha=1)
g.map_dataframe(sns.kdeplot, x="Salary", color='black')
g.fig.subplots_adjust(hspace=-.9)
g.set_titles("")
g.set_ylabels("Density")
#g.xaxis.get_label()
g.set(yticks=[])
g.despine(left=True)
<seaborn.axisgrid.FacetGrid at 0x17265a3e1f0>
We also see that with time, the range of payers pay increased dramatically. This is due to a select few NBA super-stars like Kevin Durant, Kobe Bryant, and Yao Ming being paid a heavy amount, while the rest being paid normally. This causes a greater discrepancy between players pay, and could lead to inequality among players.
During the 1990s, the NBA pushed the 3 point line back. As can be seen in the picture below.
The impact of this was that there was a sudden jump in the number of 3 point shot attempts made, and a sudden decrease in 2 point attempts. Since players be a lot less accurate, but still score high with 3 point shots, teams took great advantage of this.
plt.plot(salaryData_filtered.groupby("Season").mean()["3PA"].loc[:1993],color="Red")
plt.plot(salaryData_filtered.groupby("Season").mean()["3PA"].loc[1993:],color="Blue")
[<matplotlib.lines.Line2D at 0x17261f648b0>]
plt.plot(salaryData_filtered.groupby("Season").mean()["2PA"].loc[:1993],color="Red")
plt.plot(salaryData_filtered.groupby("Season").mean()["2PA"].loc[1993:],color="Blue")
[<matplotlib.lines.Line2D at 0x17261f90160>]
We see this drop quite dramatically in the graphs above, the average 3 point attempts per player sky rocketed from nearly 2.5 to more then 10, in just a few years. On the other had, the number of 2 point attempts, shot down from nearly 30 attempts to less then 20 attempts within a few years. The blue portion is after the change was made, the red portion is before.
fig, ax = plt.subplots(1,3,figsize=(15,8),dpi=200)
sns.set_theme(style="white", rc={"axes.facecolor": (1, 1, 1, 1)})
plt.suptitle("Link between percentage of each types of shots and winning")
sns.scatterplot(ax = ax[0],data=details,x="FG3A",y="FG3_PCT",hue="VICTORY",style="VICTORY",alpha=0.8)
ax[0].set(ylabel="Three point shots percentage", xlabel="Three point shots attempted")
sns.scatterplot(ax = ax[1],data=details,x="FGA",y="FG_PCT",hue="VICTORY",style="VICTORY",alpha=0.8)
ax[1].set(ylabel="Field goal percentage", xlabel="Field goal attempted")
sns.scatterplot(ax = ax[2],data=details,x="FTA",y="FT_PCT",hue="VICTORY",style="VICTORY",alpha=0.8)
ax[2].set(ylabel="Free throw percentage", xlabel="Free throw attempted")
plt.show()
We also see that teams with higher amounts of three point attempts, and higher three point accuracy, they tend to win a lot more. The same applies (but not as much) for Field goal and Free throw shots.
From this analysis, it is apparent that to improve their ratings, the NBA should cast more international players to draw from a larger audience. NBA can consider drafting more players from nations like India or Africa, that has a large young population that can be inspired by these NBA players and tune in more.
Teams in the future should also work more on improving players 3 point shots. More directed training towards improving the number of attempts made, as well as the accuracy of those shots could lead to more victories for those teams.
In the future, analyzing how popular players are from their social media would be really interesting. It is well known that players who have a larger social media following tend to be paid more. Analyzing performance with popularity could be interesting.
Another interesting thing to analyze would be how different teams treat/trade their players. Analyzing the different coaching techniques and their impacts on the overall game could provide really interesting and useful insights on the dynamics of the NBA, and its impact.